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ABSTRACT 

This paper briefly discusses four measurement challenges of data science or ‘big data’ in educational assessments that are 
enabled by technology: 1. Dealing with change over time via time-based data. 2. How a digital performance space’s 
relationships interact with learner actions, communications and products. 3. How layers of interpretation are formed from 
translations of atomistic data into meaningful larger units suitable for making inferences about what someone knows and 
can do. 4. How to represent the dynamics of interactions between and among learners who are being assessed by their 
interactions with each other as well as with digital resources and agents in digital performance spaces. Because of the 
movement from paper-based tests to online learning, and in order to make progress on these challenges, the authors 
advocate the restructuring of training of the next generation of researchers and psychometricians to specialize in data 
science in technology enabled assessments. This call to action stemmed from discussions at EDUsummlT 2013, which 
will be published in depth in a special issue of Education and Information Technologies. 
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1. INTRODUCTION 

Assessment and learning analytics challenges have dramatically increased since new digital performance 
affordances, user interfaces, and the targets of technology-enabled assessments have become more complex. 
The increased complexity is due in part to technology’s capabilities and roles in presenting interactive 
learning experiences and collecting rich data (de Freitas, 2014; Quellmalz et al., 2012) which is leading to the 
infusion of data science methods and techniques into learning and behavioural science research (Gibson & 
Knezek, 2011; Kozleski, Gibson, & Hynds, 2012). These changes require new quantitative methods as well 
as a reconceptualization of mixed methods (Tashakkori & Teddlie, 2003) that include domain experts as well 
as stakeholders in the construction of knowledge of such complex systems. 

In technology-enhanced assessments, the emergence of ‘big data’ - which are defined as data with a large 
numbers of records, of widely differing data types, that are rapidly collected for immediate action (IBM, 
2015; Margetts & Sutcliffe, 2013) - underscores the need to develop assessment literacy (Stiggins, 1995) in 
teachers, learners and other audiences of assessment. Assessment literacy has become more important than 
ever for understanding how technology influences and impacts assessment types and processes and especially 
for developing confidence in creating and analysing arguments from evidence, based on a user’s current 
understanding of validation (Black, Harrison, Hodgen, Marshall, & Serret, 2010). 

The paper discusses the main challenges associated with applying data science methods in educational 
assessment to address a digital media assessment’s psychometric properties; time sensitivity; digital 
performance and the problem space for analysis; the hierarchy of tasks, turns and translations between 
different levels and the dynamics of interrelationships in assessment systems. The OECD PISA plan for 
assessing CPS is used as an example to explain these challenges in relation to a complex problem space. The 
paper then illustrates with a learning analytics case that shows how the identified challenges have been 
addressed in the development of assessments. 
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2. BACKGROUND 

There is uncertainty as to whether and how four different perspectives on assessment - providing feedback 
information, supporting improvement decisions, identifying the degree of engagement and understanding, 
and making value judgments - can co-exist to the benefit of learners (Webb, Gibson, & Forkosh-Baruch, 
2013). Even with the increased possibilities that IT provides there is not yet a way to say confidently that the 
multiple purposes for which some assessments have been used (Mansell, James, & the Assessment Reform 
Group, 2009) can or should be supported through the same assessment systems. This is because the impacts 
of some purposes interact with the validation processes of others (Messick, 1994). Therefore in considering 
assessment design for multiple purposes for example for formative as well as summative purposes, users 
need to examine those impacts carefully in order to minimise negative consequences on learning and 
learners. 

Developing theory for the application of data science methods in educational research is important for 
two primary reasons. First, assessment of virtual performance presents new challenges for psychometrics 
(Clarke-Midura & Dede, 2010; Ifenthaler, Eseryel, & Ge, 2012; Quellmalz et ak, 2012). Secondly, new tools 
are needed for discovery of patterns and drivers in complex systems for working with ‘big data’ in 
educational research preparation and practice (Gibson, 2012; Patton, 2011). Indicators of progress in theory 
development would be an increase in research exploring and articulating the use of data science methods in 
learning analytics to improve learning and achievement; and the expansion of methods beyond traditional 
statistics and qualitative approaches in educational research, to include data mining, machine learning, and in 
general, the methods of data science. 


3. PSYCHOMETRIC CHALLENGES IN THE ERA OE BIG DATA 

Psychometrics is the branch of psychology that deals with the design, administration, and interpretation of 
quantitative tests for the measurement of psychological variables such as intelligence, aptitude, and 
personality traits (“Psychometrics,” 2014). A good psychometric test is “internally consistent, reliable over 
time, discriminating and of demonstrated validity in respect of its correlations with other tests, its predictive 
power and the performance of various criterion groups. It also has good norms” (Kline, 1998, p.92). 

Until recently, the field dealt almost exclusively with the construction and validation of measurement 
instruments such as questionnaires, tests, and personality assessments. However there is now a need to 
expand to include highly interactive digital learning and adaptive test experiences, such as the OECD PISA 
assessment of CPS. In brief, PISA is a triennial international survey that aims to evaluate education systems 
worldwide by testing the skills and knowledge of 15 -year-old students in order to determine the extent to 
which they can apply their knowledge to real-life situations and hence are prepared for full participation in 
society. To constrain the quite complex variables that would be involved if the collaboration was among a set 
of real people, the OECD assessment utilizes the computer to play roles as collaborators in a virtual 
performance assessment (Clarke-Midura, Code, Dede, Mayrath, & Zap, 2012). The PISA assessment plan 
incorporates a complex behaviour space that illustrates some of the new demands on psychometrics. 

The challenge with technology enabled assessments is to evolve the procedural foundations of 
psychometrics, which until recently have been primarily based on population statistics and static snapshots of 
data. The new foundation outlined here highlights the need to include time sensitivity, digital performance 
space relationships, multiple layers of aggregations at different scales, and representations of the dynamics of 
a complex behaviour space (Gibson & Jakl, 2013; Quellmalz et ak, 2012). 

3.1 Time Sensitivity 

In the OECD assessment, time is controlled as a boundary variable of the test and the computer is used to 
prompt the test taker to ‘move on’ when the evidence rule system detects that the student needs to be rescued 
from an unproductive problem-solving path. The decision to redirect appears natural to the situation because 
the computer is playing the role of one or more collaborators, so the suggestion to move on comes from a 
simulated peer. This situation illustrates that a technology-enabled assessment might well give the student 
perceived or actual control over time, compared to an assessment that only displays test item prompts in a 
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timed test. In some virtual performance assessments, time is open-ended, and the use of item resources (e.g. 
in what order, with or without returning to the resources multiple times, time spent with each resource, timing 
of the appropriate use of a resource, and total time to utilize the appropriate resources to accomplish the task) 
all may be critical to the classification of the learner’s response (Gibson & Jakl, 2013; Stevens & 
Palacio-Cayetano, 2003). 

The OECD assessment solves the time sensitivity problem by parsing time into critical events and then 
monitoring the event patterns to detect the level of evidence of the competencies in the domain model (see 
Table 1). This is a form of time segmentation, because some events cannot happen until other events have 
occurred (e.g. establishing and maintaining team organisation must occur after establishing a shared vision, 
and while maintaining that vision and taking appropriate action to solve the problem). A planned sequence of 
activities and timed release of testing resources, known in game -based learning as a ‘branching storyline’ 
(Aldrich, 2005) is a method for controlling the evolution of a process. 

Table 1. Domain model for assessing collaborative problem solving 



(1) Establishing and 
maintaining shared 
understanding 

(2) Taking appropriate 
action to solve the problem 

(3) Establishing and 
maintaining team 
organisation 

(A) Exploring and 
Understanding 

(At) Discovering 
perspectives and abiiities of 
team members 

(A2) Discovering the type of 
coilaborative interaction to 
soive the probiem, along with 
goals 

(A3) Understanding roles to 
solve problem 

(B) Representing and 
Formulating 

(B1) Buiiding a shared 
representation and 
negotiating the meaning of 
the probiem (common 
ground) 

(B2) Identifying and 
describing tasks to be 
completed 

(B3) Describe roles and team 
organisation (communication 
protocol/rules of engagement) 

(C) Planning and 
Executing 

(Cl) Communicating with 
team members about the 
actions to be/ being 
performed 

(C2) Enacting plans 

(C3) Following rules of 
engagement, (e.g., prompting 
other team members to 
perform their tasks.) 

(D) Monitoring and 
Reflecting 

(D1) Monitoring and 
repairing the shared 
understanding 

(D2) Monitoring results of 
actions and evaluating 
success in solving the 
problem 

(D3) Monitoring, providing 
feedback and adapting the 
team organisation and roles 


Other problem-solving contexts, such as coordination of group actions needed for scientific inquiry and 
experimentation, require simultaneous actions mixed with sequences of actions. The classification system of 
the assessment has to handle patterns of simultaneous and sequential interactions in order to make valid 
links to time-sensitive evidence rules within the conceptual assessment framework (CAP), which is a key 
component of evidence centred design (Mislevy, Steinberg, & Almond, 1999), an approach that is becoming 
increasingly prominent in assessment design and on which this analysis is based. The CAP has three core 
components: the student model, task model and evidence model (Mislevy et ak, 1999; Mislevy, Steinberg, & 
Almond, 2003) within and among which the time sensitive relationships adhere. 


3.2 Digital Performance Space Relationships 

A learning experience entails a designed structure of knowledge and action (Jonassen, 1997) and when that 
experience is interactive and digital there are many measurement challenges (Quellmalz et ak, 2012). The 
emerging varieties of network analysis (e.g. social networks, visualization, artificial neural networks, 
decision ttees) have arisen as new analytical tools and methods for understanding the sttuctural relationships 
in technology-enhanced learning (Choi, Rupp, Gushta, & Sweet, 2010; Shaffer et ak, 2009). In addition, the 
Uaces of knowledge and action (i.e. the actions, communications and products) created by a learner during 
the course of interacting with a digital learning application bear a relationship to that person’s mental 
representations of the problem (Newell & Simon, 1972) and the knowledge and capability they acquired. 
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accessed and utilized during the interaction (Pirnay-Dummer, Ifenthaler, & Spector, 2010; Thagard, 2010). 
This set of ideas are referred to here as "digital performance space relationships’ which will be shown to be 
similar to ‘items’ and ‘constructs’ in classical test theory. 

An interactive digital performance space can support several scenarios, each with one or more 
classification challenges for inferring what the test taker knows and can do. In the OECD assessment, for 
example, the scenarios presented to the student are designed to sample the digital performance space 
construct of ‘collaborative problem solving.’ Each scenario allows the classification of the test taker into one 
or more cells of a matrix created by the intersection of three stages of ‘collaboration’ with four stages of 
‘problem-solving’ (Table 1). In classical test theory, the ‘construct’ plays a similar role to the digital 
performance space; several test items are used to make multiple measures of the construct. A review of the 
historical idea of a valid construct is helpful for making the bridge from classical testing to the digital age. 

A valid construct was thought of as an inductive summary and as part of a series of validity investigations 
that included concurrent, predictive and content considerations. In addition, the construct can change and 
become more elaborated over time, as Cronbach noted (Cronbach & Meehl, 1955): 

When a construct is fairly new, there may be few specifiable associations by which to pin down the 
concept. As research proceeds, the construct sends out roots in many directions, which attach it to more and 
more facts or other constructs. 

Einally, the construct acquired validity through the idea of a nomological network which is a collection of 
overlapping mappings from (a) observable properties or quantities to one another; (b) different theoretical 
ideas to one another, or (c) theoretical constructs to observables (ibid). A single mapping might include 
examples of all these relations, as a construct might be a complex set of factors that interact with one another. 
The idea of a network of ideas and relationships was a fairly abstract philosophical idea in the 1950’s but 
today has a concrete meaning that has become known as network theory in social science (Borgatti & Halgin, 
2011) and network analysis in computational sciences, both of which are applied graph theory from 
mathematics (Brandes & Erlebach, 2005). This history outlines a bridge of ideas that carries forward into 
today when digital media learning spaces can record a network of traces of the actions of a learner. 

Digital media learning presents problems as well as prompts for learner performance (e.g. 
problem-solving, collaboration) in a space that is characterized by hyperlinked resources that can be 
represented as nodes and relations in a network (Clarke-Midura et ak, 2012; Quellmalz et ak, 2012; Stevens, 
2006). As a learner uses such a space to learn and perform (e.g. interacting with the resources to solve a 
problem, adding new information, re-arranging resources into new relationships) a new network can be 
created that represents the learner’s response, a time-specific performance path through the digital 
performance space (Ifenthaler et ak, 2012). The learner’s performance network is a constructed knowledge 
structure that needs to be taken into account in assessment (Gijbels, Dochy, Van den Bossche, & Segers, 
2005). The digital performance space and the constructed knowledge structure of the learner hold the same 
kind of relationship as the nomological network does to a demonstrated construct; the digital performance 
space holds the learning designer’s view of the construct (e.g. what it means to act like a scientist in a given 
situation) and the constructed knowledge structure (e.g. what the learner did in this instance) holds evidence 
of the processes and products of knowing and doing. 

The terms of the nomological network inference, which underpins a claim of construct validity, bear a 
similarity to the rules of a chain of a reasoned argument, which can lead to a claim concerning what a learner 
knows and can do as used in Evidence-Centered Design (ECD). In ECD, an argument has constituent claims, 
data, warrants and backing and must take account of alternative explanations. In a nomological network by 
comparison, there are observations, ideas and relationships and a chain of inference must be used in order to 
establish a claim that a particular test is a measure of the construct. 

The relationships and nodes of a network representation of the traces of learner interactions can be 
compared to the digital performance space resources and relationships to enable inferences about what the 
learner knows and can do (Al-diban & Ifenthaler, 2011; Quellmalz, Timms, & Schneider, 2009). Network 
measures such as similarity, centrality, clusters and pattern matching are used in such inferences, where the 
patterns of the network imply functional and structural connectivity (Sporns, 2011). Digital performance 
space relationships examined with time sensitive network analysis has increased the ability of research to 
characterise and make comment on processes, products, knowledge and know-how, and their complex 
entanglements in authentic performance settings. 
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3.3 Layers of Aggregations and Translations 

In the OECD assessment of CPS, aggregations of events into tasks takes place in a hierarchy that begins at 
the top with a scenario and ends within each task of the scenario at the level of a ‘turn’ - a game-based 
learning concept that updates the state of the scenario based on the learner’s input. 

Each problem scenario (unit) contains multiple tasks. A task, e.g., consensus building, is a particular 
phase within the scenario, with a beginning and an end. A task consists of a number of turns (exchanges, 
chats, actions, etc.) between the participants in the team. A finite number of options leading onto different 
paths are available to the participants after each turn, some of which constitute a step towards solving the 
problem. The end of a task forms an appropriate point to start the next task. Whenever the participants fail to 
reach this point a ‘rescue’ is programmed to ensure that the next task can be started (PISA, 2013). 

With this hierarchy in mind (e.g. scenarios containing tasks that contain turns) the challenge of 
aggregating with time sensitivity and translating from one level of analysis to another can be addressed with 
moving averages, sliding time windows, and event recognition. The OECD uses event recognition, in which 
an action, communication or product of the test taker triggers a reaction by the test engine to update the 
scenario, which might include rescuing the test taker. In a moving average, some window of time is selected 
(e.g. every second, or after every three turns) and an average is performed to form an abstracted data layer 
that preserves some of the shape of the data movement over time. In the sliding time window (Choi et ak, 
2010; Han, Cheng, Xin, & Yan, 2007), a combination of event recognitions and moving averages, or some 
configuration of either, might be performed and then used as an abstracted data layer. In the example case 
summarized below, for example, the time stamps of every action were subtracted from each other to compute 
duration, which was then applied to each action, to nearby action -pairs and to action -ngrams (motifs) for 
further analysis. 

Within any slice of time, or when comparing two or a few slices of time, standard statistical procedures 
and aggregations apply (e.g. means testing, correlations, regressions), but when high resolution data is 
involved (e.g. many data points per record per unit of time) and where there are complex aggregations (e.g. 
widely varying sources of data and different units of measure) then data mining techniques are more 
applicable. Of note, regression techniques in data mining are not equivalent to the same methods in statistics, 
even though the terms sound and look the same. In data mining, regression represents a search within a 
complex nonlinear space for patterns and representations of structure and causal dynamic relationships, rather 
than the reduction of error of a linear model (Schmidt & Lipson, 2009). Thus, aggregations in the two 
approaches are also of different lineage and need to be considered as separate entities with separate 
representational functions, meaning and purposes (Bates & Watts, 1988). 

3.4 Representations of Dynamics 

Systems dynamics (Bar-Yam, 1997; Sterman, 1994) involves a mathematical modeling technique for 
framing, understanding, and discussing the issues of time, digital performance space relationships and 
aggregation-translation in highly interactive technology-enhanced assessments. Eield experiments with 
systems dynamics methods have for example, focused on mid-level model-based theory building in an 
assessment context (Kopainsky, Pirnay-dummer, & Alessi, 2010). The process of building a model from 
snapshots of a dynamic system is called a ‘nonlinear state space reconstruction’ (Sugihara et ak, 2012). In 
such a state space equivalent to a network all the data falls within a finite band or manifold of behaviour. 
That is, every state of the system will be in one of the spaces created by the finite possibilities for each 
variable at some point in time. Such reconstructions of the underlying manifold governing the dynamics of a 
system can map to and uncover the causal relationships in a complex system (Schmidt & Lipson, 2009) 
including those that support inferences concerning what a user knows and can do. 

Visualizing the current status of a learner’s progress on an assessment is an example of representing a 
state of a dynamic system, as is visualizing the progress of the learner in relation to a domain model driving 
the assessment’s evidence collection processes. The Khan Academy (Khan, 2011) for example, charts 
progress in learning mathematics or science content against a visualization of the content hierarchy. If the 
learner has mastered division, a visual tree shows how mastery fits with addition and subtraction and allows 
access to the next higher level of math skill. More dynamic and fine-grained visualizations are also possible, 
for example, that would trace the process steps of a solution, or document the details of a constructive 
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process. Visualizations can aide pattern discovery involving both nonverbal and verbal expressions; for 
example, from bodies of text, from online student discussion forums, and from cognitive and mental model 
representations (Pirnay-Dummer et al., 2010). 

To date the developments in learning analytics that provide visualisations of learning traces for learners 
and teachers have been represented by learning analytics dashboards. Such dashboards have been developed 
that keep track of time, social interactions for insights into collaboration, the use of documents and tools, and 
the artefacts produced by students (Verbert, Duval, Klerkx, Govaerts, & Santos, 2013). While these 
dashboards currently fall far short of the detailed traces of assessment data that are possible to create, even 
these more limited opportunities for analysing their learning have been found to support learners' reflection 
and improve self-assessment as well as increasing course satisfaction (Verbert, et al., 2013). 

Examples of the more highly detailed traces are readily found in serious games, as well as casual games 
that are designed to be immersive and emotionally engaging rather than a simple pastime (Aldrich, 2005). In 
these game-based examples, the high -resolution feedback is always on, giving the player an up-to-date view 
of progress, hints about upcoming challenges, and a view to the longer-term goal (ftensky, 2001). Clearly 
educators and researchers might want to promote to policymakers the importance of researching the methods 
and impacts of presenting visualisations of data to teachers and learners along with developments in data 
processing that will better enable judgements of student performances. 

Perhaps the biggest unresolved issue of representation of collaborative learning (and perhaps any learning 
progress during a complex process) is how to represent the moving and evolving quality of change over time. 
‘Movies’ of dynamic educational processes have not yet been documented in many cases, and if existing, 
have not been widely disseminated into common practice. This lack of a practice base and experience 
hampers theory as well as practice in technology-enhanced assessments, and points to the need illustrated by 
the case in the next section, for future research and practice to create a shared understanding of the methods 
of data science in educational research. 


4. CASE STORY: VIRTUAL PERFORMANCE ASSESSMENT 

A case story illustrates how technology enabled educational assessment can produce a large number of 
records, how time and process can be an included mediating factor in analysis and how machine learning and 
data mining methods are needed to support the rapid simultaneous testing of multiple hypotheses. 

A game-based assessment of scientific thinking was created at Harvard (Clarke-Midura et al., 2012) and 
analysed by one of the authors (Gibson & Clarke-Midura, 2013) to ascertain the abilities of middle school 
students to design a scientific investigation and construct a causal explanation. A summary of the data 
science findings and issues included the observation of two of the three aspects of big data: volume 
(~ 821,000 records for 4000 subjects, or 205 records per subject); and variety of data (user actions, decisions 
and artifacts provided evidence of learning and thought processes). The third element of big data, velocity, 
was less important in this case; because the flow of data was not used in near -real time to give hints, correct 
mistakes, or inform the learner during the experience, so the data was streamed off to storage for later 
analysis. 

This case illustrates several of the features of big data in educational assessment. First, the context was 
captured along with the learner action, decision, and product, but that context needed to be effectively 
constructed from the smallest items of data into larger clusters of information. For example, a data element 
named ‘opened door’ by itself was relatively meaningless compared to knowing that it was a particular door, 
opened after another significant event such as talking to a scientist. Thus, patterns of action were transformed 
into n-grams (Scheffel, Niemann, & Leony, 2012) or motifs, which then became the transformed units of 
analysis. This concept of the unit of analysis containing the semantic, time and space contexts for lower 
levels of aggregation may be a new methodological requirement of digital assessments, and needs further 
study. 

Second, as a large number of users traverse through the network of possibilities in a digital performance 
space, key movements of the population within the network can be counted and then used as the basis for 
empirical prior probabilities which assist in creating Bayesian inferences about the scientific 
problem-solving path-maps of learners (Stevens, Johnson, & Soller, 2005). In particular, each pathway in 
such a network can be further characterized or specified with a predictive nonlinear mathematical 
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relationship (Gibson & Clarke-Midura, 2013), for example, found through symbolic regression an 
evolutionary machine learning technique (Schmidt & Lipson, 2009). Or, alternatively an association rule 
network can be created that distinguishes user action patterns and motifs according to the prevalence of 
utilizing one resource compared to another. For example, if 100% of the population goes to resource 3 after 
resource 1 (skipping over and not utilising resource 2), then with a very high probability, if the sample is a 
good sample of the greater population, the next user entering the system will follow that path and the 
inference system can make a highly probable educated guess about what the person now using resource 1 
will do next. 

The third feature is that the complex set of relationships in various analyses such as those just mentioned, 
bear a structural relationship to something meaningful about the digital performance space as outlined above. 
For example, a cluster analyses can reveal that some resources are critical to success and others are ignored 
and not important to the most successful learners (Quellmalz et ak, 2012) or a network visualization can 
highlight how people relate to each other or to a task such as quoting and using scientific resources (Bollen et 
ak, 2009). 


5. CONCLUSION AND IMPLICATIONS FOR TEACHING AND 
LEARNING 

This paper has introduced four challenges of big data in educational assessments that are enabled by 
technology: how to deal with change over time and time-based data; how a digital performance space’s 
relationships interact with learner actions, communications and products; how layers of interpretation are 
formed from translations of atomistic data into meaningful larger units; and how to represent the dynamics of 
interactions between and among learners who are being assessed by their interactions in digital performance 
spaces. The article linked the big data challenges to solutions offered in the OECD PISA assessment of 
collaborative problem solving, and then reviewed some of the same issues by briefly summarizing a 
particular case. 

The challenges and issues discussed in this article reveal the requirements for developments in theory as 
well as some of the practical challenges that will need to be overcome if educators are to achieve the vision 
of providing learners and teachers with a ‘quiet assessment’ system in which the impact can be turned up at 
the request of learners and teachers as they seek to understand the progress of learning. This joint approach 
which emphasises assessment AS, FOR and OF learning (Bennett, 2010) is discussed further in (Gibson & 
Webb, 2015) and (Webb & Gibson, 2015) 

In moving forward to embrace the opportunities that could be provided by technology enhanced 
assessments the challenges that remain to be addressed must not be underestimated before educators can use 
automated assessments of complex skills and understanding with confidence. 
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